Model Selection

High-Precision Visual Understanding

# High-Precision Visual Understanding

Pixtral 12b Quantized.w8a8

INT8 quantized version based on mgoin/pixtral-12b, supports vision-text multimodal tasks with optimized inference efficiency

Transformers English

VARCO VISION 14B

VARCO-VISION-14B is a powerful English-Korean Vision-Language Model (VLM) that supports image and text input, generates text output, and possesses capabilities for grounding, referencing, and OCR.

Transformers Supports Multiple Languages

Xgen Mm Phi3 Mini Instruct Interleave R V1.5

xGen-MM is a series of the latest foundational large multimodal models (LMMs) developed by Salesforce AI Research, building upon the successful design of the BLIP series with foundational enhancements to ensure a more robust and superior model foundation.

Safetensors English

Florence 2 Large Ft Moredetailed

Fine-tuned on the imageinwords dataset based on the Florence-2-large-ft model, focusing on generating more detailed image descriptions

Transformers English

Git Base Minecraft

This is a vision-based image-to-text model capable of generating image descriptions.

Image Generation

Transformers Supports Multiple Languages

Featured Recommended AI Models

AIbase

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご

© 2025AIbase